Chapter 6, exercise 1.
(a) best subset
(b) don’t know - it could be any
(c) i.-ii. True iii.-v. False
Code is hidden here in Rmd.
Set up the data so that y is now a linear function of x, but no other variable.
set.seed(20190428)
tr <- matrix(rnorm(20*100),ncol=100)
colnames(tr) <- paste0("x", 1:100)
tr <- scale(tr)
tr <- as_tibble(tr) %>% mutate(y=2*tr[,1]+rnorm(20)*0.4)
# Generate test data
ts <- matrix(rnorm(10*100),ncol=100)
colnames(ts) <- paste0("x", 1:100)
ts <- scale(ts)
ts <- as_tibble(ts) %>% mutate(y=2*ts[,1]+rnorm(10)*0.4)
Need to fit a linear model, compute the MSE, or deviance, and show the observed vs fitted data.
library(broom)
# Examine training/test in discriminant space, and training/test error
for (i in 2:18) {
tr_lm <- lm(y~., data=tr[,c(1:i,101)])
tr_p <- augment(tr_lm, tr[,c(1:i,101)])
ts_p <- augment(tr_lm, newdata=ts[,c(1:i,101)])
tr_err <- round(sum(tr_p$.resid^2),2)
ts_err <- round(sum((ts_p$y-ts_p$.fitted)^2),2)
print(
ggplot(data=tr_p, aes(x=.fitted, y=y)) +
geom_point(size=5, alpha=0.5) +
ylab("y") + xlab("fitted") +
xlim(c(-10,10)) + ylim(c(-10,10)) +
geom_point(data=ts_p,
shape=2, size=5, colour="red") +
ggtitle(paste0("p = ", i, " train = ", tr_err, " test = ", ts_err))
)
}
# Examine the coefficients
coefs <- tibble(term = factor(c("b0", paste0("b", 1:22)),
levels=c("b0", paste0("b", 1:22))), estimate=rep(0, 23))
for (i in 2:19) {
tr_lm <- lm(y~., data=tr[,c(1:i,101)])
tr_coef <- tidy(tr_lm)
coefs$estimate[1:(i+1)] <- abs(tr_coef$estimate)
print(
ggplot(data=coefs, aes(x=term, y=estimate)) +
geom_col() + ylim(c(0,4)) + xlab("Variable") +
ylab("Coefficient") +
ggtitle(paste0("p = ", i))
)
}
The coefficients simply explode to huge values.
This exercise is investigating neural network model fitting. A neural network model was fitted to the wiggle.csv using the nnet package in R, from many random starts, and using 2-4 nodes in the hidden layer. The best model is in the data object nnet_best.rda, and the full set of model fits are stored in nnet_many.rda. We are going investigate the best fit, and the complete set of fits. The data set is actually a test set, it was not used to fit the model, but it was simulated from the same process as the training set, which we don’t have.
hidden, output and nnet. The best model uses \(s=4\). The nnet component has the estimated model coefficients, fitted values and residuals. The hidden component has information related to the models used in the 4 nodes of the hidden layer, and the output has the same information for the second layer. These latter two contain a grid of values for the predictors, \(x\), \(y\) and the predicted values for each grid point.The model does amazingly well to prediect this data.
We can see that each captures one linear aspect, of the nonlinear boundary.
wgts element of the nnet component.p=2, s=4, so there are 3x4=12. s=4 and there are two levels of output, so 5x2=10. There are 22 parameters.
wgts element of the nnet component. There are 6 sets of linear model coefficients.The coefficients are:
# [1] -19.01764 18.21640 77.10483
\[s_1 = 1/(1+e^{-(-19.01764+18.21640x+77.10483y)})\] (e) OPTIONAL ADVANCED: See if you can compute the combination of the prediction on each hidden node, to get final prediction.
nnet function has its own measure of the goodness of fit, which is used to determine when to stop minimising RSS, which is called value in this data. Plot the predictive accuracy against function’s returned value of model fit. Explain how the change in \(s\) affects the predictive accuracy.The best performance is achieved by the 4 node models. The predictive accuracy matches fairly closely the fitting criteria. The biggest feature to note, though, is that there is a lot of variability across models. A different random start can generate a very poor model. It takes some work to find the bext model. But it can be a very good model.